Problem 1

1.1

Read the training data and the pre-calculated tangent distance matrix:

load("zip.RData")
load("zip.dist.RData")

Perform MDS with tangent distance to reduce the original data to a 2D space:

library(MASS)
dat.mds <- isoMDS(zip.dist, k=2)
## initial  value 34.218873 
## final  value 34.218662 
## converged
library(ggplot2)
ggplot(as.data.frame(dat.mds$points), aes(V1, V2, color=dat$train[,1])) +
  geom_point() + coord_fixed() + labs(color="digit")

The digits are well-separated, better than PCA (same as MDS with Euclidean distance).

Problem 2

2.1

Subsample 1000 digits for Isomap.

set.seed(10)
ind <- sample(1:nrow(dat$train), 1000)
x <- as.matrix(dat$train[ind, -1])
y <- dat$train[ind, 1]

Perform Isomap with k=5 to find a 2D embedding:

library(RDRToolbox)
dat.isomap <- Isomap(data=x, dims=2, k=5)
## Computing distance matrix ... done
## Building graph with shortest paths (using 5 nearest neighbours) ... done
## Computing low dimensional embedding ... done
## number of samples: 1000
## reduction from 256 to 2 dimensions
## number of connected components in graph: 1
ggplot(as.data.frame(dat.isomap$dim2), aes(V1, V2, color=y)) +
  geom_point() + coord_fixed() + labs(color="digit")

2.2

Try different parameter values (k) for Isomap:

k.vals <- c(2,3,5,10,25,50)
dat.isomap <- list()
for (i in 1:length(k.vals)) {
  dat.isomap[[i]] <- Isomap(data=x, dims=2, k=k.vals[i])$dim2
}
## Computing distance matrix ... done
## Building graph with shortest paths (using 2 nearest neighbours) ... done
## Computing low dimensional embedding ... done
## number of samples: 1000
## reduction from 256 to 2 dimensions
## number of connected components in graph: 2
## Computing distance matrix ... done
## Building graph with shortest paths (using 3 nearest neighbours) ... done
## Computing low dimensional embedding ... done
## number of samples: 1000
## reduction from 256 to 2 dimensions
## number of connected components in graph: 1
## Computing distance matrix ... done
## Building graph with shortest paths (using 5 nearest neighbours) ... done
## Computing low dimensional embedding ... done
## number of samples: 1000
## reduction from 256 to 2 dimensions
## number of connected components in graph: 1
## Computing distance matrix ... done
## Building graph with shortest paths (using 10 nearest neighbours) ... done
## Computing low dimensional embedding ... done
## number of samples: 1000
## reduction from 256 to 2 dimensions
## number of connected components in graph: 1
## Computing distance matrix ... done
## Building graph with shortest paths (using 25 nearest neighbours) ... done
## Computing low dimensional embedding ... done
## number of samples: 1000
## reduction from 256 to 2 dimensions
## number of connected components in graph: 1
## Computing distance matrix ... done
## Building graph with shortest paths (using 50 nearest neighbours) ... done
## Computing low dimensional embedding ... done
## number of samples: 1000
## reduction from 256 to 2 dimensions
## number of connected components in graph: 1
library(gridExtra)
p <- list()
for(i in 1:length(k.vals)) {
    p[[i]] <- ggplot(as.data.frame(dat.isomap[[i]]), aes(V1, V2, color=y)) +
      geom_point(size=.1) + coord_fixed() + guides(color=guide_legend(title="digit")) +
      ggtitle(paste('k =', k.vals[i])) +
      theme(legend.position="none", plot.title=element_text(size=10), axis.text.x=element_text(size=8),
            axis.text.y=element_text(size=8), axis.title=element_blank())
}
do.call(grid.arrange, c(p, ncol=3))

The results are sensitive to k.

Problem 3

3.1

Apply tSNE with perplexity = 25 to find a 2D embedding of digits data:

library(Rtsne)
set.seed(10)
dat.tsne <- Rtsne(dat$train[,-1], dims=2, perplexity=25, verbose=T)
## Read the 2219 x 50 data matrix successfully!
## Using no_dims = 2, perplexity = 25.000000, and theta = 0.500000
## Computing input similarities...
## Normalizing input...
## Building tree...
##  - point 0 of 2219
## Done in 0.29 seconds (sparsity = 0.046597)!
## Learning embedding...
## Iteration 50: error is 79.918394 (50 iterations in 0.89 seconds)
## Iteration 100: error is 72.216926 (50 iterations in 0.82 seconds)
## Iteration 150: error is 71.396813 (50 iterations in 0.78 seconds)
## Iteration 200: error is 71.113253 (50 iterations in 0.79 seconds)
## Iteration 250: error is 70.967667 (50 iterations in 0.78 seconds)
## Iteration 300: error is 1.925602 (50 iterations in 0.71 seconds)
## Iteration 350: error is 1.638494 (50 iterations in 0.68 seconds)
## Iteration 400: error is 1.508843 (50 iterations in 0.70 seconds)
## Iteration 450: error is 1.436164 (50 iterations in 0.72 seconds)
## Iteration 500: error is 1.400877 (50 iterations in 0.72 seconds)
## Iteration 550: error is 1.380000 (50 iterations in 0.74 seconds)
## Iteration 600: error is 1.365542 (50 iterations in 0.75 seconds)
## Iteration 650: error is 1.354502 (50 iterations in 0.75 seconds)
## Iteration 700: error is 1.345434 (50 iterations in 0.75 seconds)
## Iteration 750: error is 1.338209 (50 iterations in 0.74 seconds)
## Iteration 800: error is 1.332063 (50 iterations in 0.75 seconds)
## Iteration 850: error is 1.326741 (50 iterations in 0.77 seconds)
## Iteration 900: error is 1.322192 (50 iterations in 0.76 seconds)
## Iteration 950: error is 1.318307 (50 iterations in 0.77 seconds)
## Iteration 1000: error is 1.314517 (50 iterations in 0.76 seconds)
## Fitting performed in 15.14 seconds.
ggplot(as.data.frame(dat.tsne$Y), aes(V1, V2, color=dat$train[,1])) +
  geom_point() + coord_fixed() + labs(color="digit")

3.2

Try different perplexity values:

prep.vals <- c(2,5,10,25,50,100)
calc.tsne <- function(seed) {
  dat.tsne <- list()
  set.seed(seed)
  for (i in 1:length(prep.vals)) {
    dat.tsne[[i]] <- Rtsne(dat$train[,-1], dims=2, perplexity=prep.vals[i])$Y
  }
  dat.tsne
}
dat.tsne <- calc.tsne(seed=10)
plot.tsne <- function() {
  p <- list()
  for(i in 1:length(prep.vals)) {
    p[[i]] <- ggplot(as.data.frame(dat.tsne[[i]]), aes(V1, V2, color=dat$train[,1])) +
      geom_point(size=.1) + coord_fixed() + guides(color=guide_legend(title="digit")) +
      ggtitle(paste('preplexity =', prep.vals[i])) +
      theme(legend.position="none", plot.title=element_text(size=10), axis.text.x=element_text(size=8),
            axis.text.y=element_text(size=8), axis.title=element_blank())
  }
  do.call(grid.arrange, c(p, ncol=3)) 
}
plot.tsne()

The results are sensitive to perplexity.

3.3

Use different seeds, and the results are different since the global optimum is not guaranteed to be found.

dat.tsne <- calc.tsne(seed=100)
plot.tsne()

dat.tsne <- calc.tsne(seed=1000)
plot.tsne()

Problem 4

4.1

Apply KNN with one nearest neighbor to the test data:

library(class)
y.test.knn <- knn(dat$train[,-1], dat$test[,-1], dat$train[,1], k=1, prob=F)

4.2

Confusion matrix for the test data:

table(y.test.knn, dat$test[,1])
##           
## y.test.knn   1   3   5
##          1 261   0   1
##          3   2 157   5
##          5   1   9 154

Misclassification rate for the test data:

mean(y.test.knn != dat$test[,1])
## [1] 0.03050847

4.3

Apply KNN with different k values (only include odd values since there would be much more ties when k is even):

set.seed(10)
k.vals <- seq(1, 21, by=2)
misrates <- sapply(k.vals, function(k) mean(knn(dat$train[,-1],dat$test[,-1],dat$train[,1],k=k,prob=F) != dat$test[,1]))

Plot the error rate against k:

qplot(k.vals, misrates, geom="line") + labs(x="k", y="error rate")

k = 1, 5 and 7 all yield same best result.

4.4

Function for plotting the digit data as an image:

conv.image <- function(vec)
{
mat <- matrix(as.numeric(vec), nrow=16, ncol=16)
mat <- -mat[, 16 : 1]
par(mar=c(0, 0, 0, 0))
image(mat, col=gray(seq(0, 1, 0.01)), xaxt='n', yaxt='n')
}

Here are digits that are actually 3’s but are misclassified into 5’s (k=1). Only the first one is plotted here, and the rest are saved into pdf files.

mis.three <- intersect(which(y.test.knn==5), which(dat$test[,1]==3))
conv.image(dat$test[mis.three[1], -1])

for (i in 1:length(mis.three)) {
  pdf(paste0('mis.three.knn.', i, '.pdf'))
  conv.image(dat$test[mis.three[i], -1])
  dev.off()
}

Digits that are actually 5’s but are misclassified into 3’s:

mis.five <- intersect(which(y.test.knn==3), which(dat$test[,1]==5))
conv.image(dat$test[mis.five[1], -1])

for (i in 1:length(mis.five)) {
  pdf(paste0('mis.five.knn.', i, '.pdf'))
  conv.image(dat$test[mis.five[i], -1])
  dev.off()
}

Problem 5

5.1

The QDA classifier stopped with error:

dat.qda <- qda(V1 ~ ., data=dat$train)
## Error in qda.default(x, grouping, ...): rank deficiency in group 1

5.2

Apply the LDA classifier and test it on the test data:

dat.lda <- lda(V1 ~ ., data=dat$train)
y.test.lda <- predict(dat.lda, dat$test)$class

Confusion matrix for the test data:

table(y.test.lda, dat$test[,1])
##           
## y.test.lda   1   3   5
##          1 259   0   1
##          3   4 151  11
##          5   1  15 148

Misclassification rate for the test data:

mean(y.test.lda != dat$test[,1])
## [1] 0.05423729

5.3

Digits that are actually 3’s but are misclassified into 5’s:

mis.three <- intersect(which(y.test.lda==5), which(dat$test[,1]==3))
conv.image(dat$test[mis.three[1], -1])

for (i in 1:length(mis.three)) {
  pdf(paste0('mis.three.lda.', i, '.pdf'))
  conv.image(dat$test[mis.three[i], -1])
  dev.off()
}

Digits that are actually 5’s but are misclassified into 3’s:

mis.five <- intersect(which(y.test.lda==3), which(dat$test[,1]==5))
conv.image(dat$test[mis.five[1], -1])

for (i in 1:length(mis.five)) {
  pdf(paste0('mis.five.lda.', i, '.pdf'))
  conv.image(dat$test[mis.five[i], -1])
  dev.off()
}

5.4

Apply the High-Dimensional Regularized Discriminant Analysis, and use the train function in the caret package to tune the parameters. There are 3 tunable hyperparameters in this method: lambda (pooling parameter, which shifts the covariance-matrix towards pooled covariance or separate covariance), gamma (shrinkage parameter, which shifts the covriance-matrix towards/away from diagonal matrix), shrinkage_type (the type of covariance-matrix shrinkage, ridge or convex).

library(caret)
## Loading required package: lattice
set.seed(1234)
train.control <- trainControl(method="repeatedcv", number=5, repeats=5, search="random")
dat.hdrda <- train(V1 ~ ., data=dat$train, method="hdrda", trControl=train.control)
y.test.hdrda <- predict(dat.hdrda, dat$test)

The results of the tuning process:

dat.hdrda
## High-Dimensional Regularized Discriminant Analysis 
## 
## 2219 samples
##  256 predictor
##    3 classes: '1', '3', '5' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 1775, 1776, 1775, 1775, 1775, 1774, ... 
## Resampling results across tuning parameters:
## 
##   gamma      lambda       shrinkage_type  Accuracy   Kappa    
##   0.6092747  0.640310605  convex          0.9854015  0.9773447
##   0.6222994  0.860915384  ridge           0.9845002  0.9759442
##   0.6233794  0.009495756  convex          0.9912586  0.9864361
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were gamma = 0.6233794, lambda
##  = 0.009495756 and shrinkage_type = convex.

Confusion matrix for the test data:

table(y.test.hdrda, dat$test[,1])
##             
## y.test.hdrda   1   3   5
##            1 261   0   0
##            3   2 154   2
##            5   1  12 158

Misclassification rate for the test data:

mean(y.test.hdrda != dat$test[,1])
## [1] 0.02881356